Please install the following packages before you start this tutorial:
Please download a GitHub repository for both of tutorial days
For this tutorial, we will be analyzing the a dataset of scRNA-seq (10x Genomics) to profile cell composition across a time course of human/chimpanzee organoid development from pluripotency to four months using embryonic stem cells (H9) and iPSC (409b2) lines and chimpanzee iPSC (Sandra/SandraA) lines (Kanton et al. 2019). This means, the dataset contains multiple readcount metrics per cells including the timepoint
Differentiation of human and chimpanzee cerebral organoids.
Load the library here and Seurat package
library("here")
library("dplyr")
library("Seurat")
We start by reading in the data. The readRDS() function reads Large Seurat files in the output of the Seurat pipeline, returning a S4 SeuratObject. The values in assays represent the multiple read count tables for each feature (i.e. gene; row) that are detected in each cell (column). For example, hu[[“RNA”]]@counts
Let’s import the rds files using here
hu <- readRDS("files/timecourse_human_singlecells_GRCh38.rds")
hu <- subset(x = hu, downsample = 100, invert = TRUE)
#100 cells in each cell type and remove idents with invert
hu <- readRDS("files/Tutorial_tc_human_integ_RNAassay.rds")
hu
## An object of class Seurat
## 33694 features across 2000 samples within 1 assay
## Active assay: RNA (33694 features, 4506 variable features)
## 2 dimensional reductions calculated: pca, tsne
ch <- readRDS("files/Tutorial_tc_chimp_integ_RNAassay.rds")
ch
## An object of class Seurat
## 32856 features across 2499 samples within 1 assay
## Active assay: RNA (32856 features, 3680 variable features)
## 2 dimensional reductions calculated: pca, tsne
Explore the Large Seurat objects
head(hu@meta.data, 5)
hu[["RNA"]]@counts[1:10, 1:30]
## 10 x 30 sparse Matrix of class "dgCMatrix"
## [[ suppressing 30 column names 'EC00044', 'EC00060', 'EC00075' ... ]]
##
## RP11-34P13.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
## FAM138A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
## OR4F5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
## RP11-34P13.7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
## RP11-34P13.8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
## RP11-34P13.14 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
## RP11-34P13.9 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
## FO538757.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
## FO538757.2 1 . . . . . . 1 . . . 1 . . 1 . . 1 1 1 . . 1 . . . . . . .
## AP006222.2 . 2 . . . 1 . . . . . . . 1 . . . 1 . . . . . . . . . . . .
head(ch@meta.data, 5)
ch[["RNA"]]@counts[1:10, 200:230]
## 10 x 31 sparse Matrix of class "dgCMatrix"
## [[ suppressing 31 column names 'EC14799', 'EC07259', 'EC07809' ... ]]
##
## MIR1302-2HG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
## FAM138A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
## OR4F5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
## AL627309.1 . . . . . . . . 1 . . . . . . . . . . . . . . . . . . . . . .
## AL627309.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
## AL627309.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
## AL627309.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
## AL732372.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
## OR4F29 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
## AC114498.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
As you see, those rds files are already pre-processed, for instance cell lines are already integrated and the linear/non-linear dimensional reduction has already been performed. We’ll now split the datasets, so we can practice the whole process to learn how to run a standard Seurat clustering pipeline.
Split the dataset into a list of two seurat objects by cell line.
Split_hu <- SplitObject(hu, split.by = "line")
Split_hu
## $H9
## An object of class Seurat
## 33694 features across 971 samples within 1 assay
## Active assay: RNA (33694 features, 4506 variable features)
## 2 dimensional reductions calculated: pca, tsne
##
## $h409B2
## An object of class Seurat
## 33694 features across 1029 samples within 1 assay
## Active assay: RNA (33694 features, 4506 variable features)
## 2 dimensional reductions calculated: pca, tsne
Split_ch <- SplitObject(ch, split.by = "line")
Split_ch
## $SandraA
## An object of class Seurat
## 32856 features across 559 samples within 1 assay
## Active assay: RNA (32856 features, 3680 variable features)
## 2 dimensional reductions calculated: pca, tsne
##
## $Sandra
## An object of class Seurat
## 32856 features across 1940 samples within 1 assay
## Active assay: RNA (32856 features, 3680 variable features)
## 2 dimensional reductions calculated: pca, tsne
After spliting cell lines from the dataset, the next step is to normalize the data. By default, we employ a global-scaling normalization method “LogNormalize” that normalizes the feature expression measurements for each cell by the total expression, multiplies this by a scale factor (10,000 by default), and log-transforms the result. Normalized values are stored in NorFea_hu$H9[[“RNA”]]@data.
We next calculate a subset of features that exhibit high cell-to-cell variation in the dataset (i.e, they are highly expressed in some cells, and lowly expressed in others). Focusing on these genes in downstream analysis helps to highlight biological signal in single-cell datasets. This is implemented in the FindVariableFeatures() function. By default, we return 2,000 features per dataset. These will be used in downstream analysis, like PCA.
The normalization and identification of variable features for each dataset independently will be done in one commmend using laaply().
NorFea_hu <- lapply(X = Split_hu, FUN = function(x) {
x <- NormalizeData(x)
x <- FindVariableFeatures(x, selection.method = "vst", nfeatures = 2000)
})
NorFea_hu
## $H9
## An object of class Seurat
## 33694 features across 971 samples within 1 assay
## Active assay: RNA (33694 features, 2000 variable features)
## 2 dimensional reductions calculated: pca, tsne
##
## $h409B2
## An object of class Seurat
## 33694 features across 1029 samples within 1 assay
## Active assay: RNA (33694 features, 2000 variable features)
## 2 dimensional reductions calculated: pca, tsne
NorFea_hu$H9[["RNA"]]@data[1:10, 1:30]
## 10 x 30 sparse Matrix of class "dgCMatrix"
## [[ suppressing 30 column names 'EC00044', 'EC00060', 'EC00296' ... ]]
##
## RP11-34P13.3 . . . . . . . . .
## FAM138A . . . . . . . . .
## OR4F5 . . . . . . . . .
## RP11-34P13.7 . . . . . . . . .
## RP11-34P13.8 . . . . . . . . .
## RP11-34P13.14 . . . . . . . . .
## RP11-34P13.9 . . . . . . . . .
## FO538757.3 . . . . . . . . .
## FO538757.2 0.5656387 . . . . . 0.3090438 0.4445882 .
## AP006222.2 . 1.005712 0.4071387 . . . . . .
##
## RP11-34P13.3 . . . . . . . . . . . .
## FAM138A . . . . . . . . . . . .
## OR4F5 . . . . . . . . . . . .
## RP11-34P13.7 . . . . . . . . . . . .
## RP11-34P13.8 . . . . . . . . . . . .
## RP11-34P13.14 . . . . . . . . . . . .
## RP11-34P13.9 . . . . . . . . . . . .
## FO538757.3 . . . . . . . . . . . .
## FO538757.2 0.4108021 0.4861095 . . 0.3577626 . . . . . . .
## AP006222.2 0.4108021 . . . . . . . . 0.6489736 0.4776818 .
##
## RP11-34P13.3 . . . . . . . . .
## FAM138A . . . . . . . . .
## OR4F5 . . . . . . . . .
## RP11-34P13.7 . . . . . . . . .
## RP11-34P13.8 . . . . . . . . .
## RP11-34P13.14 . . . . . . . . .
## RP11-34P13.9 . . . . . . . . .
## FO538757.3 . . . . . . . . .
## FO538757.2 0.5001261 . . 0.4881182 . . . 0.2898991 .
## AP006222.2 0.5001261 . . 0.4881182 . 0.7170238 . 0.2898991 0.3503035
VFP_H9 <- VariableFeaturePlot(NorFea_hu$H9)
VFP_H9 <- LabelPoints(plot = VFP_H9, points = head(VariableFeatures(NorFea_hu$H9), 10), repel = TRUE, xnudge = 0, ynudge = 0)
VFP_h209B2 <- VariableFeaturePlot(NorFea_hu$h409B2)
VFP_h209B2 <- LabelPoints(plot = VFP_h209B2, points = head(VariableFeatures(NorFea_hu$h409B2), 10), repel = TRUE, xnudge = 0, ynudge = 0)
VFP_H9
VFP_h209B2
NorFea_ch <- lapply(X = Split_ch, FUN = function(x) {
x <- NormalizeData(x)
x <- FindVariableFeatures(x, selection.method = "vst", nfeatures = 2000)
})
NorFea_ch
## $SandraA
## An object of class Seurat
## 32856 features across 559 samples within 1 assay
## Active assay: RNA (32856 features, 2000 variable features)
## 2 dimensional reductions calculated: pca, tsne
##
## $Sandra
## An object of class Seurat
## 32856 features across 1940 samples within 1 assay
## Active assay: RNA (32856 features, 2000 variable features)
## 2 dimensional reductions calculated: pca, tsne
NorFea_ch$SandraA[["RNA"]]@data[1:10, 200:230]
## 10 x 31 sparse Matrix of class "dgCMatrix"
## [[ suppressing 31 column names 'EC14799', 'EC07259', 'EC07809' ... ]]
##
## MIR1302-2HG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
## FAM138A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
## OR4F5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
## AL627309.1 . . . . . . . . 0.4267403 . . . . . . . . . . . . . . . . . . . . .
## AL627309.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
## AL627309.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
## AL627309.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
## AL732372.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
## OR4F29 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
## AC114498.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
##
## MIR1302-2HG .
## FAM138A .
## OR4F5 .
## AL627309.1 .
## AL627309.3 .
## AL627309.2 .
## AL627309.4 .
## AL732372.1 .
## OR4F29 .
## AC114498.1 .
VFP_SA <- VariableFeaturePlot(NorFea_ch$SandraA)
VFP_SA <- LabelPoints(plot = VFP_SA, points = head(VariableFeatures(NorFea_ch$SandraA), 10), repel = TRUE, xnudge = 0, ynudge = 0)
VFP_S <- VariableFeaturePlot(NorFea_ch$Sandra)
VFP_S <- LabelPoints(plot = VFP_S, points = head(VariableFeatures(NorFea_ch$Sandra), 10), repel = TRUE, xnudge = 0, ynudge = 0)
VFP_SA
VFP_S
Now, we see that the number of variable features has been changed to
2000 in each cell line. We will now select features that are repeatedly
variable across datasets for integration.
features_h <- SelectIntegrationFeatures(object.list = NorFea_hu)
head(features_h, n = 10)
## [1] "TTR" "HTR2C" "PCP4" "LGALS1" "DLK1" "CXCL14" "COL1A1" "COL3A1"
## [9] "IGFBP7" "PTN"
features_c <- SelectIntegrationFeatures(object.list = NorFea_ch)
head(features_c, n = 10)
## [1] "APOA1" "APOA2" "COL3A1" "APOB" "IGF2" "TTR" "APOA4" "COL1A1"
## [9] "LUM" "DLK1"
We then identify anchors using the FindIntegrationAnchors() function, which takes a list of Seurat objects as input, and use these anchors to integrate the two datasets together with IntegrateData().
anchors_hu <- FindIntegrationAnchors(object.list = NorFea_hu, anchor.features = features_h)
## Scaling features for provided objects
## Finding all pairwise anchors
## Running CCA
## Merging objects
## Finding neighborhoods
## Finding anchors
## Found 2423 anchors
## Filtering anchors
## Retained 2257 anchors
anchors_hu
## An AnchorSet object containing 4514 anchors between 2 Seurat objects
## This can be used as input to IntegrateData.
anchors_ch <- FindIntegrationAnchors(object.list = NorFea_ch, anchor.features = features_c)
## Scaling features for provided objects
## Finding all pairwise anchors
## Running CCA
## Merging objects
## Finding neighborhoods
## Finding anchors
## Found 2377 anchors
## Filtering anchors
## Retained 857 anchors
anchors_ch
## An AnchorSet object containing 1714 anchors between 2 Seurat objects
## This can be used as input to IntegrateData.
We then pass these anchors to the IntegrateData() function, which returns a Seurat object.
The returned object will contain a new Assay, which holds an integrated (or ‘batch-corrected’) expression matrix for all cells, enabling them to be jointly analyzed.
hu_integ <- IntegrateData(anchorset = anchors_hu)
## Merging dataset 1 into 2
## Extracting anchors for merged samples
## Finding integration vectors
## Finding integration vector weights
## Integrating data
hu_integ
## An object of class Seurat
## 35694 features across 2000 samples within 2 assays
## Active assay: integrated (2000 features, 2000 variable features)
## 1 other assay present: RNA
ch_integ <- IntegrateData(anchorset = anchors_ch)
## Merging dataset 1 into 2
## Extracting anchors for merged samples
## Finding integration vectors
## Finding integration vector weights
## Integrating data
ch_integ
## An object of class Seurat
## 34856 features across 2499 samples within 2 assays
## Active assay: integrated (2000 features, 2000 variable features)
## 1 other assay present: RNA
Now, the number of features has also been changed to 2000 in each object and the active assay is set to ‘integrated’. The original unmodified data still resides in the ‘RNA’ assay.
As this integrated assay is already normalized and the variable features are identified, we can continue with scaling, a linear transformation that is a standard pre-processing step prior to dimensional reduction techniques like PCA. The ScaleData() function shifts the expression of each gene, so that the mean expression across cells is 0 and scales the expression of each gene, so that the variance across cells is 1, so highly-expressed genes do not dominate. The results of this are stored in hu_integ[[“RNA”]]@scale.data
hu_integ <- ScaleData(hu_integ, verbose = FALSE)
ch_integ <- ScaleData(ch_integ, verbose = FALSE)
Next, we perform PCA on the scaled data. By default, only the previously determined variable features are used as input and we define the number of PCs to compute and store (50 by default).
hu_integ <- RunPCA(hu_integ, verbose = FALSE)
hu_integ
## An object of class Seurat
## 35694 features across 2000 samples within 2 assays
## Active assay: integrated (2000 features, 2000 variable features)
## 1 other assay present: RNA
## 1 dimensional reduction calculated: pca
ch_integ <- RunPCA(ch_integ, verbose = FALSE)
ch_integ
## An object of class Seurat
## 34856 features across 2499 samples within 2 assays
## Active assay: integrated (2000 features, 2000 variable features)
## 1 other assay present: RNA
## 1 dimensional reduction calculated: pca
Now, the Seurat object contains one dimensional reduction: pca.
To overcome the extensive technical noise in any single feature for scRNA-seq data, Seurat clusters cells based on their PCA scores, with each PC essentially representing a ‘metafeature’ that combines information across a correlated feature set. The top principal components therefore represent a robust compression of the dataset. However, how many components should we choose to include? To answer this question, we’ll consider three approaches below.
Seurat provides several useful ways of visualizing both cells and features that define the PCA, including VizDimReduction(), DimPlot(), and DimHeatmap(). In particular DimHeatmap() allows for easy exploration of the primary sources of heterogeneity in a dataset, and can be useful when trying to decide which PCs to include for further downstream analyses.
For instance, let’s visualize the first and last PC.
DimHeatmap(hu_integ, dims = c(1,50), cells = 500, balanced = TRUE)
DimHeatmap(ch_integ, dims = c(1,50), cells = 500, balanced = TRUE)
The first PC shows a clear heterogeneity, while the last one doesn’t. The cutoff should be somewhere between PC1 and PC50 (wink wink ;))
As next approach, we randomly permute a subset of the data (1% by default) and rerun PCA, constructing a ‘null distribution’ of feature scores, and repeat this procedure. We identify ‘significant’ PCs as those who have a strong enrichment of low p-value features. This implements a statistical test based on a random null model, but is time-consuming for large datasets.
Determine stastistical significance of PCA scores using JackStraw(), compute JackStraw scores significance using ScoreJAckStraw. JackStrawPlot() function provides a visualization tool for comparing the distribution of p-values for each PC with a uniform distribution (dashed line). ‘Significant’ PCs will show a strong enrichment of features with low p-values (solid curve above the dashed line).
This takes up to 5 min
JS_hu <- JackStraw(hu_integ, dims = 50, num.replicate = 100)
JS_hu <- ScoreJackStraw(JS_hu, dims = 1:50)
JackStrawPlot(JS_hu, dims = c(1, 50))
## Warning: Removed 3149 rows containing missing values (`geom_point()`).
JS_ch <- JackStraw(ch_integ, dims = 50, num.replicate = 100)
JS_ch <- ScoreJackStraw(JS_ch, dims = 1:50)
JackStrawPlot(JS_ch, dims = c(1, 50))
## Warning: Removed 3063 rows containing missing values (`geom_point()`).
Although both first and last PCs have a significant p-value (< 0.05), the PC1 shows a stronger enrichment of features with low p-values. The cutoff should be somewhere between PC1 and PC50 (wink wink ;))
Last approach is an alternative heuristic method generating an Elbow plot, which ranks the principle components based on the percentage of variance explanined by each one.
ElbowPlot(hu_integ, ndims = 50)
ElbowPlot(ch_integ, ndims = 50)
Which PC should be the threshold to define the cutoff?
Hint1: Some are difficult to distinguish from background noise for a dataset of this size without prior knowledge. We advise to set the cutoff on the higher side. Hint2: JackStrawPlot() may not return a clear PC cutoff, please focus on a sharp drop-off in significance. Hint3: Please take a look in how many PCs the majority of true signal is captured in ElbowPlot()
After deciding the parameter, rerun the RunPCA() with defined npcs.
hu_integ <- RunPCA(hu_integ, npcs = [number of pca of your choice], verbose = FALSE)
ch_integ <- RunPCA(ch_integ, npcs = [number of pca of your choice], verbose = FALSE)
we first construct a KNN graph based on the euclidean distance in PCA space, and refine the edge weights between any two cells based on the shared overlap in their local neighborhoods. This step is performed using the FindNeighbors() function, and takes as input the previously defined dimensionality of the dataset (first x PCs how you decided).
The FindClusters() function iteratively groups cells together, with the goal of optimizing the standard modularity function, and contains a resolution parameter that sets the ‘granularity’ of the downstream clustering, with increased values leading to a greater number of clusters. The clusters can be found using the Idents() function.
hu_integ <- FindNeighbors(hu_integ, dims = 1:[number of pca of your choice])
hu_integ <- FindClusters(hu_integ, resolution = 0.6)
ch_integ <- FindNeighbors(ch_integ, dims = 1:[number of pca of your choice])
ch_integ <- FindClusters(ch_integ, resolution = 0.6)
## Computing nearest neighbor graph
## Computing SNN
## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
##
## Number of nodes: 2000
## Number of edges: 59562
##
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.9179
## Number of communities: 17
## Elapsed time: 0 seconds
## Computing nearest neighbor graph
## Computing SNN
## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
##
## Number of nodes: 2499
## Number of edges: 75701
##
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.9114
## Number of communities: 14
## Elapsed time: 0 seconds
Look at cluster IDs of the first 5 cells and check how many clusters (=levels) are found.
head(Idents(hu_integ), 5)
## EC00044 EC00060 EC00296 EC00606 EC00704
## 11 11 11 11 11
## Levels: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
head(Idents(ch_integ), 5)
## EC00002 EC00070 EC00096 EC00136 EC00148
## 11 11 11 11 11
## Levels: 0 1 2 3 4 5 6 7 8 9 10 11 12 13
Seurat offers two major non-linear dimensional reduction techniques, such as tSNE and UMAP, to visualize and explore these datasets. The goal of these algorithms is to learn the underlying manifold of the data in order to place similar cells together in low-dimensional space. Cells within the graph-based clusters determined above should co-localize on these dimension reduction plots. As input to the UMAP and tSNE, we’ll use the same PCs as input to the clustering analysis.
hu_integ <- RunUMAP(hu_integ, reduction = "pca", dims = 1:[number of pca of your choice])
ch_integ <- RunUMAP(ch_integ, reduction = "pca", dims = 1:[number of pca of your choice])
## Warning: The default method for RunUMAP has changed from calling Python UMAP via reticulate to the R-native UWOT using the cosine metric
## To use Python UMAP via reticulate, set umap.method to 'umap-learn' and metric to 'correlation'
## This message will be shown once per session
## 18:11:59 UMAP embedding parameters a = 0.9922 b = 1.112
## 18:11:59 Read 2000 rows and found 20 numeric columns
## 18:11:59 Using Annoy for neighbor search, n_neighbors = 30
## 18:11:59 Building Annoy index with metric = cosine, n_trees = 50
## 0% 10 20 30 40 50 60 70 80 90 100%
## [----|----|----|----|----|----|----|----|----|----|
## **************************************************|
## 18:11:59 Writing NN index file to temp file /var/folders/zv/_22zlcmj58n4_8mg523zflfc0000gn/T//RtmpuVk57Z/file222d2c0eeff3
## 18:11:59 Searching Annoy index using 1 thread, search_k = 3000
## 18:12:00 Annoy recall = 100%
## 18:12:00 Commencing smooth kNN distance calibration using 1 thread with target n_neighbors = 30
## 18:12:00 Initializing from normalized Laplacian + noise (using irlba)
## 18:12:00 Commencing optimization for 500 epochs, with 77296 positive edges
## 18:12:03 Optimization finished
## 18:12:03 UMAP embedding parameters a = 0.9922 b = 1.112
## 18:12:03 Read 2499 rows and found 20 numeric columns
## 18:12:03 Using Annoy for neighbor search, n_neighbors = 30
## 18:12:03 Building Annoy index with metric = cosine, n_trees = 50
## 0% 10 20 30 40 50 60 70 80 90 100%
## [----|----|----|----|----|----|----|----|----|----|
## **************************************************|
## 18:12:03 Writing NN index file to temp file /var/folders/zv/_22zlcmj58n4_8mg523zflfc0000gn/T//RtmpuVk57Z/file222d8ab511e
## 18:12:03 Searching Annoy index using 1 thread, search_k = 3000
## 18:12:03 Annoy recall = 100%
## 18:12:03 Commencing smooth kNN distance calibration using 1 thread with target n_neighbors = 30
## 18:12:04 Initializing from normalized Laplacian + noise (using irlba)
## 18:12:04 Commencing optimization for 500 epochs, with 96180 positive edges
## 18:12:06 Optimization finished
DimPlot(hu_integ, reduction = "umap", group.by = "line")
DimPlot(hu_integ, reduction = "umap", split.by = "line")
DimPlot(hu_integ, reduction = "umap")
DimPlot(ch_integ, reduction = "umap", group.by = "line")
DimPlot(ch_integ, reduction = "umap", split.by = "line")
DimPlot(ch_integ, reduction = "umap")
### tSNE (t-distributed stochastic neighbor embedding)
hu_integ <- RunTSNE(hu_integ, reduction = "pca", dims = 1:[number of pca of your choice])
ch_integ <- RunTSNE(ch_integ, reduction = "pca", dims = 1:[number of pca of your choice])
DimPlot(hu_integ, reduction = "tsne", group.by = "line")
DimPlot(hu_integ, reduction = "tsne", split.by = "line")
DimPlot(hu_integ, reduction = "tsne")
DimPlot(ch_integ, reduction = "tsne", group.by = "line")
DimPlot(ch_integ, reduction = "tsne", split.by = "line")
DimPlot(ch_integ, reduction = "tsne")
## Excercises
Try to cluster the cells with a different number of PCs (10, 15, or even 50!). Do you observe a dramatic difference?
What differences do you observe between UMAP and tSNE plot?
Markers define clusters via differential expression. By default, it identifies positive and negative markers of a single cluster (specified in ident.1), compared to all other cells. FindAllMarkers() automates this process for all clusters, but you can also test groups of clusters vs. each other, or against all cells. The FindAllMarkers() function has three important arguments which provide thresholds for determining whether a gene is a marker:
min.pct: This determines the fraction of cells in either group the feature has to be expressed in to be included in the analysis. It’s meant to remove lowly expressed genes. The default is 0.1 and this means at least 10 % of cells in cluster x or all other clusters must express the gene. The lower the value, the longer computing time. If this is set to high value, many false positive could be included due to the fact that not all genes are detected in all cells (even if it is expressed)
min.diff.pct: Minimum percent difference between the percent of cells expressing the gene in the cluster and the percent of cells expressing gene in all other clusters combined. This will downsample each identity class to have no more cells than whatever this is set to. The default is -Inf and this means no difference threshold is set. The lower the value, the longer computing time.
logfc.threshold: Minimum log2 foldchange for average expression of gene in cluster relative to the average expression in all other clusters combined. The default is 0.25 and this means the average log2 foldchange should be at least 0.25. The lower the value, the longer computing time. If this is set to high value, many weak signals could be missed.
For performing differential expression after integration, we switch back to the original
DefaultAssay(hu_integ) <- "RNA"
hu_integ
## An object of class Seurat
## 35694 features across 2000 samples within 2 assays
## Active assay: RNA (33694 features, 0 variable features)
## 1 other assay present: integrated
## 3 dimensional reductions calculated: pca, umap, tsne
DefaultAssay(ch_integ) <- "RNA"
ch_integ
## An object of class Seurat
## 34856 features across 2499 samples within 2 assays
## Active assay: RNA (32856 features, 0 variable features)
## 1 other assay present: integrated
## 3 dimensional reductions calculated: pca, umap, tsne
Now, find markers for every cluster compared to all remaining cells, report only the positive ones.
Markers_hu <- FindAllMarkers(hu_integ, only.pos = TRUE, min.pct = 0.25, logfc.threshold = 0.25)
## Calculating cluster 0
## Calculating cluster 1
## Calculating cluster 2
## Calculating cluster 3
## Calculating cluster 4
## Calculating cluster 5
## Calculating cluster 6
## Calculating cluster 7
## Calculating cluster 8
## Calculating cluster 9
## Calculating cluster 10
## Calculating cluster 11
## Calculating cluster 12
## Calculating cluster 13
## Calculating cluster 14
## Calculating cluster 15
## Calculating cluster 16
Markers_hu
Markers_ch <- FindAllMarkers(ch_integ, only.pos = TRUE, min.pct = 0.25, logfc.threshold = 0.25)
## Calculating cluster 0
## Calculating cluster 1
## Calculating cluster 2
## Calculating cluster 3
## Calculating cluster 4
## Calculating cluster 5
## Calculating cluster 6
## Calculating cluster 7
## Calculating cluster 8
## Calculating cluster 9
## Calculating cluster 10
## Calculating cluster 11
## Calculating cluster 12
## Calculating cluster 13
Markers_ch
Check if you have the average log2 foldchange lower than 0.25 and how many markers are detected in each cluster.
sum(Markers_hu$avg_log2FC < 0.25)
## [1] 0
Markers_hu %>% count(cluster)
sum(Markers_ch$avg_log2FC < 0.25)
## [1] 0
Markers_ch %>% count(cluster)
Let’s take a look into the marker expression. For that, we’ll collect top markers with lowest p-value per cluster
top2_hu <- Markers_hu %>%
group_by(cluster) %>%
top_n(n = 2, wt = avg_log2FC)
top2_hu
top2_ch <- Markers_ch %>%
group_by(cluster) %>%
top_n(n = 2, wt = avg_log2FC)
top2_ch
Seurat package provides several tools for visualizing marker expression. VlnPlot() (shows expression probability distributions across clusters), and FeaturePlot() (visualizes feature expression on a tSNE or PCA plot) are our most commonly used visualizations. Also, RidgePlot(), CellScatter(), and DotPlot() are available.
Here, we look at the expression level of the top 2 markers with lowest p-value in cluster 0. In case of using FeaturePlot(), the default is set to umap reduction.
VlnPlot(hu_integ, features = top2_hu %>% filter(cluster %in% "0") %>% pull(-1))
VlnPlot(ch_integ, features = top2_ch %>% filter(cluster %in% "0") %>% pull(-1))
RidgePlot(hu_integ, features = top2_hu %>% filter(cluster %in% "0") %>% pull(-1))
## Picking joint bandwidth of 0.0144
## Picking joint bandwidth of 0.022
RidgePlot(ch_integ, features = top2_ch %>% filter(cluster %in% "0") %>% pull(-1))
## Picking joint bandwidth of 0.0478
## Picking joint bandwidth of 0.224
FeaturePlot(hu_integ, features = top2_hu %>% filter(cluster %in% "0") %>% pull(-1))
DimPlot(hu_integ, reduction = "umap")
FeaturePlot(ch_integ, features = top2_ch %>% filter(cluster %in% "0") %>% pull(-1))
DimPlot(ch_integ, reduction = "umap")
FeaturePlot(hu_integ, features = top2_hu %>% filter(cluster %in% "0") %>% pull(-1), reduction = "tsne")
DimPlot(hu_integ, reduction = "tsne")
FeaturePlot(ch_integ, features = top2_ch %>% filter(cluster %in% "0") %>% pull(-1), reduction = "tsne")
DimPlot(ch_integ, reduction = "tsne")
Using Dotplot(), we can get a more comprehensive overview of marker
expressions by providing the top 2 markers in all clusters.
DotPlot(hu_integ, features = unique(top2_hu %>% pull(-1)), cols = c("blue", "red"), dot.scale = 8) +
RotatedAxis()
DotPlot(ch_integ, features = unique(top2_ch %>% pull(-1)), cols = c("blue", "red"), dot.scale = 8) +
RotatedAxis()
Play with the three major arguments of FindAllMarkers() function! For each task, check the computing time with Sys.time() and visualize the marker expression with a different number of top markers (2, 5, or even 10!)
Try to include more genes per cluster by adjusting the minimun value of the average log2 foldchange. Does it take longer than the code in the tutorial? How many markers are detected in each cluster? Do clusters contain more specific markers?
Try to include less genes per cluster by excluding lowly expressed genes with less than 50 % of cells in cluster x or all other clusters must express the gene. Does it take longer than the code in the tutorial? How many markers are detected in each cluster? Do clusters contain more specific markers?
Try to include more cluster-specifically expressed markers by adjusting the minimum difference between two groups. Does it take longer than the code in the tutorial? How many markers are detected in each cluster? Do clusters contain more specific markers?
After identifying cell type markers in each cluster, it’s also important to assign cell type identity to clusters. Getting canonical markers to known cell types still challenging. Depending on source of cells, e.g. brain or heart, or species, e.g. human or drosophila, there’re various databases and studies for markers. In this tutorial, we’ll use a table containing cell types and corresponding markers based on a previous scRNA-seq study.
Import the table containig marker information
CTmarkers <- read.table("files/markers.txt", sep = "\t", header = T)
dim(CTmarkers)
## [1] 9871 2
head(CTmarkers)
We’ll collect top 10 markers with lowest p-value and this will be compared with the imported table
top10_hu <- Markers_hu %>%
group_by(cluster) %>%
top_n(n = 10, wt = avg_log2FC)
top10_hu
top10_ch <- Markers_ch %>%
group_by(cluster) %>%
top_n(n = 10, wt = avg_log2FC)
top10_ch
We’ll now create a custom function to search the most matching cell type to our markers. For each cluster, the function will check how often the top 10 markers match to certain cell types based on the imported table. The mostly matched cell type for each cluster will be returned.
CTretrieve <- function(x){
cluster <- c()
for(i in 1:length(levels(x$cluster))){
top10 <- x %>% filter(cluster %in% levels(x$cluster)[i]) %>% pull(-1)
cluster <- c(cluster, CTmarkers %>%
filter(gene %in% top10) %>%
count(cluster) %>%
arrange(desc(n)) %>%
slice(1) %>%
pull(1))
}
return(cluster)
}
CTretrieve(top10_hu)
## [1] "cortical neurons 1" "ventral progenitors and neurons 1"
## [3] "G2M/S dorsal progenitors 1" "neuroectodermal-like cells"
## [5] "NSC/radial glia" "NSC"
## [7] "G2M/S dorsal progenitors 2" "stem cells 1"
## [9] "ventral progenitors and neurons 3" "stem cells 1"
## [11] "G2M/S ventral progenitors and IPs" "stem cells 1"
## [13] "gliogenic/outer radial glia" "choroid plexus"
## [15] "midbrain/hindbrain" "mesenchymal-like cells"
## [17] "mesenchymal-like cells"
CTretrieve(top10_ch)
## [1] "cortical neurons 1" "G2M/S dorsal progenitors 2"
## [3] "G2M/S dorsal progenitors 2" "cortical neurons 1"
## [5] "IPs and early cortical neurons" "stem cells 1"
## [7] "ventral progenitors and neurons 3" "mesenchymal-like cells"
## [9] "cortical neurons 2" "mesenchymal-like cells"
## [11] "mesenchymal-like cells" "stem cells 1"
## [13] "neuroectodermal-like cells" "gliogenic/outer radial glia"
With this, we can rename our clusters. So far, they were 0, 1, 2, …
new.cluster.ids_hu <- CTretrieve(top10_hu)
names(new.cluster.ids_hu) <- levels(hu_integ)
hu_integ <- RenameIdents(hu_integ, new.cluster.ids_hu)
DimPlot(hu_integ, reduction = "umap")
DimPlot(hu_integ, reduction = "tsne")
new.cluster.ids_ch <- CTretrieve(top10_ch)
names(new.cluster.ids_ch) <- levels(ch_integ)
ch_integ <- RenameIdents(ch_integ, new.cluster.ids_ch)
DimPlot(ch_integ, reduction = "umap")
DimPlot(ch_integ, reduction = "tsne")
Enough or should we add DE analysis?